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ABSTRACT 


Tne Naval Postgraduate School's Student Opinion Form 
data were subjected to study through the use of two cluster 
analysis techniques: (1) K-MEANS partitioning method and 
(2) Chernoff's FACES. Much developmental work was performed 
to tailor these methods to the special requirements of the 
data set. A thorough multivariate statistical review pro- 
vided the basis for choosing optimality criteria and distance 
functions for use in the MIKCA (Multivariate Iterative K-MEANS 
Clustering Algorithm). Alterations were made to the computer 
code to allow the analysis to include the effect of class 
size on cluster membership. Use of the linear discriminant 
function aided in identifying variables for use in constructing 
features of the computer-drawn faces. This approach to the 
Chernoff's FACES technique shows promise but needs further 
development. A principal components analysis of the data 
showed it to be essentially one dimensional. Partitioning 
the data into four clusters shows that the scoring of the 


courses varies inversely with class size. 
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i. sr ODUCTION 


The Student Opinion Form (SOF) used at the Naval Post- 
graduate School provides an organized information gathering 
mechanism about each course (and its instructor) as per- 
ceived by the students. The information obtained from the 
SOF data is used for administrative review of faculty 
performance and for feedback to the instructor to aid in 
self-development. The former use is hampered by the fact 
that the data are multivariate in nature and represent a 
complicated set of interactions between the instructor's 
performance, the nature of the course, and the group of 
Students. There is need for methodology which can disen- 
tangle those interactions and summarize the data in a 
meaningful way. 

It is the purpose of this thesis to develop suitable 
cluster analysis methods for studying the data and dis- 
covering any hidden structure they may possess. Concurrently, 
a certain amount of exploratory data analysis took place, 
and those results are reported also. 

At the completion of every quarter, students are requested 
to respond to a l6-item SOF questionnaire for each course 
in which they are enrolled. The data are viewed as anon by p 
matrix, representing n observations (SOFs), each of which is 
measured on p (16) different variables. For this research 


the mean vector of each course was computed. Then attempts 
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were made to discover natural clusters of these mean vectors 
which in turn can be interpreted as the underlying structure 
in the data. Since the number of students per course is 
quite variable, the mean vectors are not equally well 
determined. Also, the matrix of mean vectors may have a 
covariance structure quite different from that of the full 

n by p data matrix. 

The clustering objective was pursued by two multivariate 
statistical methods: one computer-graphic technique referred 
to as Chernoff's FACES, and a second, more mathematically 
oriented approach called K-~MEANS. The former produces computer- 
drawn cartoon faces, the features of which are controlled 
by variables in the data. The asSignment of variables to 
features was aided by the use of linear discriminant analysis. 
One face 1s produced from each course mean vector, and then 
the researcher is able to study the appearance of the faces 
and cluster together those that display similar character- 
istics. The second method utilizes a computer program 
called MIKCA (Multivariate Iterative K-MEANS Clustering 
Algorithm) which is based on the K-MEANS method. It forms 
an initial partition of the data and then transfers obser- 
vations between clusters in order to improve an optimality 
criterion function. In this iterative manner, MIKCA ulti- 
mately stabilizes and provides an "optimal" cluster solution. 

In addition a modified MIKCA technique was employed. 


Alterations were made to the baSic computer code to enable 
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the program to ewer each mean vector by the number of 
students in the course. This modification may be likened 
to a one-way Analysis of Variance (ANOVA) having unbalance 
in the number of observations per treatment. The result is 
to stabilize the relative variability of the various course 
mean vectors. 

Most multivariate analysis methodology is derived assuming 
the data have a multivariate normal distribution with common 
covariance matrix. The performance of the MIKCA program 
and the linear discriminant analysis will not depend greatly 
upon this assumption provided the clusters are well defined. 
On the other hand, if the clusters are not well separated, 
the results of the programs will be sensitive to these 
assumptions, and this is the condition anticipated. Accordingly, 
a transformation was sought toward this end. The one 
selected is essentially a logistic function. 

It is frequently necessary to compare the agreement of 
cluster solutions produced under different conditons or by 
different methods. For this purpose, a computer program 
was written which provides an ad hoc measure of the amount 
of agreement between the results of two or more solutions. 

A number between zero and one, called the comparison coeffi- 
Client, is the resulting measure of association. 

This thesis was largely exploratory and should serve 

as a firm foundation for future study of the SOF data in 


particular and similarly structured multivariate data sets 
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in general. A number of unexpected questions are raised 
during the exploratory pnases of this research. It was 

not possible to answer many of these questions, and their 
consideration is left to other researchers. During the 
development of the methodologies, some new and challenging 
problems were encountered. Many of these had to be given 
rather short treatment in the interest of meeting the original 
objectives. It should be emphasized that although some very 
interesting facts are revealed in this thesis, the results 

are by no means considered to describe completely the 


information hidden in the data. 
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II. CLUSTER ANALYSIS 


A. ORIGIN AND THEORY 

Cluster analysis is the name given to a body of diverse 
techniques for discovering taxonomical structure within bodies 
of data. It is one of several methodologies included in 
the broader category called classification. In cluster 
analysis little or nothing is known about the category 
Structure. All that is available is a collection of obser- 
vations whose category memberships are unknown, and one must 
discover a category structure which fits the observations. 
The objective is to find the natural groups by sorting the 
observations such that the association is high among members 
of the same group and low between members of different groups. 
The great challenge to the researcher is finding the most 
appropriate way of defining "natural groups" and "association. 
Cluster analysis is closely related to and often confused 
with discriminant analysis, a statistical procedure for 
asSigning new observations to known cece In contrast 
to discriminant analysis, clustering refers to discovery of 
the initial groups. 

Although modern clustering techniques began development 
in biological taxonomy, they are generally applicable to 
all types of data. Any method which partitions a set of 
objects into subsets on the basis of measurements taken on 


every object qualifies as a clustering method. Cluster 
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analysis techniques are most often applied in multivariate 
settings, that is where each of n observations iS measured 
on p different variables. A clear intuitive picture of the 
concept is helpful in appreciating the value of cluster 
analysis and the situations to which it might be applied. 
In a geometric sense, every object (observation) may be 
viewed aS a point in p-dimensional Euclidean space. This 
swarm of data points may contain dense regions or "clouds" 
of data points which are separable from other regions 
containing a low density of points. These denser regions 
constitute what are known as clusters. In the one and two 
dimensional cases, it is easy for the human eye to quickly 
detect the clusters from scatter plots, assuming that the 
clusters exist. In higher dimensions, clustering attempts 
become extremely difficult without the aid of computers. 
Solutions to the clustering problem usually involve the 
determination of a partition which satisfies some optimality 
criterion. The optimality criterion is a way of measuring 
how good a particular cluster solution is relative to other 
solutions. An astounding number of possible solutions exist. 
Reference 1 describes a Stirling number of the second kind 
representing the number of ways n objects may be sorted 


into m groups. 





The number of groups is usually unknown so the problem is 
compounded, and the total number of possibilities is a sum 
of Stirling numbers. In the case of 25 observations, the 


total number of possible cluster solutions is 


(3) 
451 25 
which exceeds 4x107°. This illustrates that the enumerative 


technique for finding solutions can require huge amounts of 
computer time, and there exists a need for a better way. 

Modern techniques allow solutions to be found without 
evaluating the criterion for each and every solution. How- 
ever the need ae ranking solutions is evident, and the 
criterion function serves to meet this need. A wide variety 
of such functions exists, and the choice is usually determined 
by the particular characteristics of the research being con- 
ducted. A more detailed discussion of optimality criteria 
1s presented in Section II.C. 

Mathematical clustering techniques usually call for a 
concept of distance between objects. In order to solve the 
cluster problem, it is desirable to define the terms "Simi- 
larity" and "difference" in a quantitative fashion. What 
does it mean to say two objects are different? Perhaps an 
investigator would assign two observations to the same group 
if the distance between them is sufficiently small, or to 
different clusters if this distance is sufficiently large. 


Common reference to the closeness of objects is made in 
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units of length, weight, or time. Numerous methods for 
measuring distance will be discussed in Section II.D. 

In the following discussion, Xs and = represent two 
points in p-dimensional Euclidean space (ED? corresponding 
to objects or obServations. Any non-negative real-valued 
Eunction D(X; ,X5) satisfying the following conditions 
qualifies as a distance function (or metric). 

a. D(X; 7X.) > 0 for all X; and X, in E 


J P 


0 if and only if Xx, = xX, 


S. Cee) eee) 

d. D(Xj,X;) < D(X;,X,) + D(X,,X,) 
where X., Kas and X, are any three points in EO: Later 
discussions will place particular emphasis on the Mahalanobis 
Metric. 

The use of cluster analysis is applicable in nearly 
every field of study. The literature is both voluminous and 
diverse, the terminology differing from one field to another. 
“Numerical taxonomy" is frequently substituted for cluster 
analysis among biologists, botanists, and ecologists, while 
some social scientists may prefer "typology." Other fre- 
quently encountered terms are pattern recognition and par- 
titioning. While discriminant analysis has been studied by 
Statisticians for nearly 45 years, cluster analysis has only 
recently come to statistical notice. 

Cluster analysis is an exploratory device, a tool for 
Suggestion and discovery. A question often asked is "How do 


you know when you have a good set of clusters?" The answer 
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is that the clusters themselves are not interesting; the 

point of interest is in inference about the structure of 

the data. The clusters do not explain the structure; they 

are consequences of the structure. The explanatory struc- 

ture is the object of the search and its description is 

in terms of principles and ideas, not individual data units. 
It 1S important to realize that a given set of data 

May contain no "right" classification, but possibly many 

different, meaningful classifications. It could be the 


case that the data contain no clusters at all. 


B. SCATTER MATRIX DECOMPOSITION 

Described in this section are the multivariate terminology 
and notation to be used on this thesis. The literature 
contains as many different notational structures as authors. 
The emphasis is on Simplicity, while also exposing the reader 
to some of the more common terminology. 

In general, multivariate data are viewed as ann by p 
matrix referred to as X. It represents n observations, each 
of which consists of measurements on p different variables. 
The cross products matrix is analogous to the univariate sum 
of squared deviations from the mean and is represented by 


the p by p matrix T. 


es: 





where 


i 1s the j-th observation vector in the i-th 
J group. 


x 1s the grand mean vector of the data. 

g 1s the number of groups. 

n; is the number of observations in the i-th 
group. 


Prime notation indicates transpose. 


All vectors are column vectors. Cross product matrices 
are also referred to as scatter matrices. Division of T 
by n-l (where n represents the total number of observations) 
yields the total variance-covariance matrix, sometimes referred 
to as a dispersion matrix. 

The total sum of squares (cross products) matrix may 
be expressed as the sum of the within-group and the between- 


group scatter matrices: 


ae 
Meant ) ) ( sg) ie soe “y Bs 
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eee Pm ) 
1=1 


Ney, 





where Xs, is the mean vector of the i-th group. Each 
individual group has itsS own scatter matrix Ws and W is 


the sum of these matrices: 


This discussion is intended to be completely general, 
With no particular group structure in mind. Later we 
shall explore the differences in the two group structures 
represented by the SOF data. 

(1) Individual SOFs are considered to be the obser- 

vations and the courses are the groups. 

(2) The course mean vectors are the observations and 

the clusters of courses are the groups. 
These two group structures are different ways of viewing 
the data; their relationship shall be explained in Section 


my 3B. 


C. OPTIMALITY CRITERIA 

Most of the well known clustering techniques fall into 
one of two main categories: (1) hierarchical and (2) par- 
titioning. The former class is one in which every cluster 
obtained at any stage is a merger of clusters at previous 
Stages. The non~hierarchical procedures however form new 
clusters by lumping and splitting old ones. 

Partitioning methods were used in this research. The 


Main idea is to choose some initial partition and then aiter 





the cluster membership in an effort to improve the partition. 
Different interpretations of what constitute a "better" 
partition and numerous ways of achieving this improvement 
have led to a great variety of algorithms. These methods 
are related to the steepest descent algorithms used for 
unconstrained optimization in nonlinear programming. Such 
algorithms begin with an initial point and then converge 
to a local optimum, moving one step at a time, the value of 
the objective function improving at each step. A well known 
example is the ISODATA procedure developed by Ball and Hall 
at Stanford Research Institute. Chapter IV discusses a 
partitioning method known as K-MEANS which was developed 
by MacQueen [2]. He uses the term "K-MEANS" to denote the 
process of assigning each data unit to that cluster (of 
k clusters) with the nearest centroid (mean vector). The 
cluster centroids change with each transfer of an observation. 
The decomposition of the total scatter into within 
and between components suggests possible optimality criteria 
to be used in a clustering algorithm. One would like the 
within-groups scatter to be small relative to the between- 
groups scatter. Various trial clusterings could be formed 
uSing the W and B matrices as a basis for the optimaltiy 
criteria which determine the best clustering. A possible 
choice for a criterion is to minimize trace W over all 
Pepeltions into g groups. Since T is constant over all 
partitions, minimizing trace W is equivalent to maximizing 


trace B since 


2) 





teacen uw = trace W + trace B 


Although trace W 1s invariant under an orthogonal transfor- 
Mation, it 1S not invariant under other non-Singular linear 
transformations. 

McRae [3] points out that trace W equals the total within 
group sum of squares, hence the "minimum variance partition" 
cluster solution is found by minimizing trace W. 

Considerable study has been devoted to alternative 
criteria such as those based on multivariate statistical 
analysis techniques, especially the methods of linear 
discriminant analysis and multivariate analysis of variance. 
Assuming the p variables are not linearly dependent, then 
Bemong 2S p < n-g, W is positive definite symmetric and 

1 


so is W. Attempts to make B and W as different as possible 


lead one to solving the determinantal equation: 


|B - AW; = 0 
The solutions hs are the eigenvalues of the matrix eos 
There are t non-zero eigenvalues, where t 1s the minimum 
of p and g-l. This is a consequence of the fact that, 1 £ 
g is less than p, the g group means are contained in a 
(g-1)-dimensional hyperplane. When g = 2 the analysis 1s 
equivalent to two-group discriminant analysis. Linear 


discriminant analysis would take the vectors originally 


me 





described in a p-dimensional coordinate system and trans- 
form the baSis to a t-dimenSional system. Maximizing the 
largest of these eigenvalues 1S a criterion suggested by 
S.N. Roy. Maximizing the trace of wots, however iS a 
criterion known as Hotelling's trace criterion. In both 
Mies, large values for these statistics are sought in 
clustering algorithms since large values indicate large 
differences among (between) groups. Minimizing the ratio 
of determinants |W| + |T| is a criterion widely known as 


Wilks' lambda. Since T is the same for all partitions, 


this criterion is equivalent to minimizing det W. 


Both trace W +B and |T| + |W| may be expressed in terms 
of the eigenvalues of wtp. 
om 
i = 
i=l 
t 
ai _ 
taaceen =o — 2 A; 
i=l 
where t = min(p,g-l). Therefore minimizing det W is 


equivalent to maximizing 1(1+A,). 

Friedman and Rubin [4] describe the advantages of the 
various criteria. Those based on multivariate statistical 
considerations (all but trace W) are invariant under changes 
in scale for the variables (non-Singular linear transformation). 


In fact, they are the only invariants for W and B under such 
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transformations. iMmmederetonm, the multivariate criteria 


may take into account covariation among the variables. 


D. DISTANCE CONSIDERATIONS 

As indicated earlier there exist a number of choices 
for measuring distance between objects. The choice of 
distance function if no less important than the choice of 
variables to be used in the study. A serious difficulty 
lies in the fact that knowledge of the clusters changes the 
choice of distance functions. In the computation of the 
distance, a variable which distinguishes well between two 
established clusters might be weighted more heavily than 
Others. Friedman and Rubin describe this difficulty as 
the “bootstrap" nature of the problem. Knowledge of the 
clusters would suggest an appropriate distance function 
which in turn would allow one to determine the original 
clusters. The trace W criterion implies ordinary Euclidean 
distance and thus hides this circularity. Use of the cri- 
teria which are invariant under non-Singular linear trans- 
formations deals effectively with this circularity. 

The familiar Euclidean distance is illustrated in 
figure la. When p = 2 the geometric interpretation of this 
measure amounts to determining distances by circles. Two 
points such as A and B on the same circle are considered 
equidistant from the origin, while other points such as 
C and D are further from the origin than A and B. 

A general class of squared distance functions is provided 


by utilizing positive definite quadratic forms. Specifically, 
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if 8 represents a p-dimensional observation to be assigned 
to one of s groups, then to measure the squared distance 
between 8 and the centroid of the i-th group one may 


consider the function 


where M is a positive definite matrix to ensure that 

D, > 0. Different metrics are represented by different 
choices of the matrix M. When M = I (the identity matrix) 
the resulting metric is the standard Euclidean distance. 

The variance within the data may make the unweighted 
Euclidean inetric inappropriate. Referring to figure 1b 
where x has a larger variance than y, one may wish to weight 
a deviation in the x direction less than an equal deviation 
in the y direction. A method for accomplishing this is 
through use of an elliptical (weighted Euclidean) distance 
function which makes points A and B equidistant from the 
Origin. The matrix M in this case is diagonal with diagonal 
elements equal to the reciprocals of the variances of the 
different variables. Insofar as the variance represents the 
true structure in the data, this distance function will 
adjust for differences due to the scale of measurement of 
each of the variables. Extending this idea further, one 

May consider the covariance among variables as well. Figure 
lc shows how the axes may be tilted so that the major axis 


is oriented in a direction of reflecting the positive 
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correlation between x and y. Again, points on the same 
ellipse are considered equidistant from the origin. The 
matrix M in this case is the inverse of the covariance 
metrix. 

Further examination of this concept is an important 
consideration in this research. If C; represents the co- 
variance matrix of the i-th cluster then the distance 


monction 


uses the appropriate covariance structure when determining 
distance to a particular cluster centroid. Note that the 
number of observations in every cluster must exceed the 
dimensionality p in order to preserve the nonsingularity 

of C;- Since C; changes to reflect the dispersion internal 
to each particular cluster, the use of this metric exploits 
differences in the dispersion characteristics of the 
different groups. Figure ld illustrates the idea. Note how 
a new observation (denoted by u) is closer to the centroid 
of group one (Gl) in terms of Euclidean distance but is 
more likely to be assigned to group two (G2) when using the 
Cc; matrix. It is instructive to point out here that if one 
were looking for boundaries dividing the p-dimensional 
Space into regions, one for each of the g groups, such 
boundaries would be non-linear. In the performance of 
discriminant analysis, Eisenbeis [5] suggests appropriate 


quadratic classification rules. 
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Another choice for the M matrix in equation 1 is 
os where C represents the pooled within groups covariance 


matrix of all the clusters. 


Recall from Section II.B: 


This distance is the well known Mahalanobis distance. 
Note that C does not change from group to group. To ensure 
Siemeon-singularity of C it must be true that p < (n-g) 


where 


n represents the total number of observations over all 
groups. 

The use of the Mahalanobis metric in the original p- 
dimensional space is equivalent to using Euclidean distance 
in the t-dimensional discriminant space with basis vectors 


a 


corresponding to the eigenvectors of WB. Note that the 
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determination of the discriminant space was based on the 
assumption of homogeneity of the cluster covariance struc- 
ture. The Mahalanobis distance function therefore adjusts 
for both scale of measurement of the variables and covaria- 
tion among the variables. Use of this metric is equivalent 
to computing distances on variables transformed to their 
principal components. 

The natural metric to use with the trace W criterion is 
the Euclidean distance. However, when uSing criteria based 
On multivariate statistical considerations, Mahalanobis is 
the natural metric to use. 

When the clusters are distributed as p-variate normal 
and have equal covariance matrices, then Fisher's linear 
discriminant function is applicable, as is the Mahalanobis 
distance. The accuracy of the Mahalanobis. metric is sensi- 
tive to the homogeniety of the cluster dispersions and 
decreases as the difference between the group dispersions 
increases. Recall the density function for the multivariate 


Mommal distribution 


1 Wee 
-5(x-u) °) 
f(xiu,)) = 2 ; we = ° 
(2) P/ “(det J) 


where ) is the covariance matrix and yp is the mean vector 


of the distribution. Note the exponent which implies 


utilization of Mahalanobis distance is equivalent to 
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measurement of the density at the point x. The empirical 
distributions of the clusters will therefore determine the 
cluster to which the observation should be assigned. The 
following is a proof of the invariance of Mahalanobis 
distance under any non-Singular linear transformation. 


Consider the transformation 
Y = BX 


and let OY 725) represent Mahalanobis distance between Ys 


ena Y.. 


s] 
D(¥;-¥,) = (Yj - ¥ ye ey ey.) 
= (BX. - Bx.)? ere (BX; - BX.) 
= (x, - x5)" BS ee B (X, - X,) 
= (X, - Ks)" ae (BC,B) B(X; - X,) 
= (X; - X53)" e (X; - X5) 
= D(X, ,X,) 


Some other common metrics are defined below. 
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Ift. THE DATA SET 


A. ORIGIN 

The present Student Opinion Form (SOF) system was 
started in the summer quarter of 1975 when it replaced the 
Student Instruction Report (SIR) obtained from the Educa- 
tional Testing Service at Princeton. The SOF form has 16 
questions and space for free-form comments from the students. 
The information obtained from the SOF data is used for the 
twofold purpose mentioned in Section I.A of this paper. 

A SOF form (figure 2) should be completed by each stu- 
dent for each course segment he takes for credit. The term 
“course segment" is used because the same course may be 
offered to more than one group of students. To differen- 
tliate between the classes, segment numbers are assigned and 
a separate SOF identification number exists for each segment. 
Different segments of the same course may or may not be 
taught by the same professor. About 20 percent of the forms 
are not returned to administration officials due to lack of 
interest on the part of some students and instructors. 
Students have been informed that the results of the SOF 
data are used to assist in identifying faculty members for 
pay raises and tenure considerations. 

Difficulties with legibility of the completed forms and 
with the OpScan machine have persisted for several quarters. 


The data available for this research has been coded with 
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indications where invalid responses occur. Only the valid 
information was considered in this thesis. Mean scores 
were computed for every instructor (every course segment) 
from the valid responses in each of the first 13 SOF items. 
Only the first 13 questions were used because of — 
percentage of unusable responses in items 14, 15, and 16. 
Each of the responses recorded is an integer from one to 
five, with five being the upper (more desirable) end of the 
scale. These data are therefore considered on an ordinal 
scale. Table one categorizes the blocks of data which were 
available for this study. Note the short 3-digit notation 
to be used in this paper, indicating calendar year and 


quarter number. 


CALENDAR NUMBER OF S=DiIGi TE 
YEAR RESPONDENTS CODE 
Summer 1977 2440 ES 
Fall VOW ZG 774 
Winter 1978 8056 (el 
Sering 1972 2964 782 
Table l 


The majority of the analysis was performed uSing only 
quarter 773. Unless otherwise indicated, future references 


to the data set shall imply quarter 773 data. 
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B. TRANSFORMATION 

The need for a common covariance structure when using 
the Mahalanobis metric has been emphasized. The transfor- 
mation of quarter 773 data (which attempted to accomplish 
homogeneity of dispersions) is explained in this section. 

The SOF data are 13-dimensional, and the best trans- 
formations would involve separate examination of each of the 
13 variables. Due to the overwhelming complexity of this 
task, only a single transformation was sought. 

In the SOF data the variance is very much a function 
of the mean. In fact, a course with a 5.0 mean vector has 
no variance whatsoever. Similar effects occur on the lower 
end of the scale. A variance-Stabilizing transformation 
was sought which would help to relieve the dependence of the 
Variance on the mean. Recall the normal distribution has 
independent mean and variance. Other well known distribu- 
tions. such as the Exponential, Geometric, and Poisson all 
have related mean and variance. The assumption of multi- 
variate normality underlies much of standard classical multi- 
variate statistical methodology. The effects of departure 
from normality are not clearly understood. Although marginal 
normality does not imply joint mormality, the presence of 
many types of non-normality is often reflected in the marginal 
distributions as well. The marginal distributions of the 
SOF data do not indicate any strong departures from normality. 

Previous research by Professor R.R. Read [7] encountered 


the same need for a transformation of the SOF data. The 


eS) 





ee eee eEeEeee—ee ee ee lll —_ 


following transformation is due to Professor Read's 


mindings: 


x-l+a 


in (cae 


) where a= .2 (1) 
The transformation was used on SOF item 12, and Bartlett's 
test substantiated the presence of homogeneity of variances. 
The groups involved here were the course segments, and the 
application was univariate. 

Studies by Professor Glen Lindsay [8] and students in 
his course on Scaling Techniques produced results which 
suggested slight modifications to Professor Read's transfor- 


mation. 


ier (X7ita 


a= where a = 2.0 (2) 


Oo" 
T 
Le 


The same study could be described equally well with a con- 
stant second difference model, or what 1s the same thing, 


the function 


X" + C (3) 


The three transformations were considered in the following 


Manner. It was felt that the transformation which would 


produce the most nearly homogeneous covariance structure 


would be best. The three functions were applied to quarter 
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773 data, and then statistical tests for common covariance 
were administered. The test statistic comes from reference 5 
and is explained in Appendix A. The results indicated that 
of the three, the first log transformation (1) generated 
the most nearly common covariance structure. The group 
Structure whose covariance matrices were compared came from 
clusters formed by the MICKA algorithm (to be discussed in 
the next chapter). On the basis of the test results, the 
data were transformed by function (1), and all subsequent 
references to the data shall imply the transformed data. 
Functions (1) and (2) are shown together on the graph 


_ in figure 3. The one chosen for use is the lower curve. 


C. PRINCIPAL COMPONENTS 

Recall the breakdown of the cross products matrix into 
the sum of the within and between scatter matrices. When 
considering the observations as individual SOFs (and the 
groups aS courses), the cross productS matrix will be called 


the Master scatter matrix with decomposition: 


where S is the within course scatter and T is the between 
course scatter. It is reemphasized that, in this equation, 
the groups are the courses. The breakdown of the master 
Scatter matrix may be examined before any clustering of 


course means 1s performed because the group structure 


a) 
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(courses are groups) iS known. The discussion is enhanced 


by an algebraic description of the matrices involved. Let 


Ne oa 
q 
M = y ) (A as ee, - X__) 
i=l j=l 
Nn Sy 
, 
Ss a) (Xi, 7 Xi) (Ke Kh) 
1=1 j=1 
N 
? 
ee eos ee x (Xo xX) 
t=) 
where 
Xs. 1s the j-th SOF response form from the i-th 
J course. 
X, 1s the mean vector of the i-th course. 
X_. is the grand mean. 
ns. 1s the number of students in the i-th course. 
N 1s the total number of courses. 


Since T represents the diSpersion of the course means, 1t 

1s the main object of the clustering efforts. It is natural 
to ask also, how much information is in S. To this end a 
principal components analysis was performed on the covariance 


matrices: 
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Con = Neo ah N-l = 189 
es) —SsY (ns. 1) = 1993 
S N a 

5 (ns.-1) eumeuctheeh 773 Gata 


Anderson [6] describes principal components as the axes of 
a coordinate system with special statistical properties. 
The principal components form a new coordinate system 
resulting from linear transformations of the variables 
which produce the special properties in terms of variances. 
The idea is to describe the data swarm by a new set of 
orthogonal coordinates so that the sample variances with 
respect to the new coordinates are in decreasing order. 

If the eigenvalues of the covariance matrix are ordered, 


1 Z: 


coordinate system iS greatest in the dimension associated 


mere As > A~ > «ee > at then the variance in the new 


with hye next greatest in the dimension associated with hos 
etc. The sum of the eigenvalues is the total variance in 
the original coordinate systen. 

The results of the principal components analysis are 


Shown in Table 2. First, it is of interest to compute how 


much of the total energy in M is accounted for by T. 


Petes Gee (1993) + 156(189) = 66952 
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PRINCIPAL COMPONENTS ANALYSIS 





Cr Cor —_—L,_ Co Co aa ae 
EIGENVALUES a 4 EIGENVALUES 13 
i 0.44 O26 O52 =O s27 
2 eS. 9 7 0.30 0.47 = 18 8) 
3 4.64 0.28 0.49 Oma 
+ 0.46 0.20 Oeeop) Oa 2 9 
2 B06 0.23 OZ =-Q.19 
6 Ae | Ole Jee) 0.63 OP ILE, 
7 4 Ors Oreos eUe24 
8 Js 156) ONrZe7 O26¢6 =O 29 
9 1.47 Oms2 G77 Om 3 
LO 0.66 02 350 OS, One 
ik J O20 Jes 0/0, =O 
2 0. 96 Oe inet 0 Te) 
eS 0.84 Ore26 LOO 0 AG) 
TOTAL 156 18.8 
TABLE 2 


T accounts for 29484 — 66952 = 44 percent of the total. 
This indicates that a great deal of variability must there- 
fore be accounted for within the courses (1i.e., with the 
Students). 

The principal components analysis of Co shows the first 
principal component accounts for 55 percent of its total 


variance, but all other coordinate directions each account 


for 6 percent or less. Moreover, the direction of the first 
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component is essentially the main diagonal of 13 space, 

i.-e., the Signs are all the same and so are the magnitudes 
(approximately). Thus the data swarm may be thought of 

as an elongated ellipsoid directed along the main diagonal 
and having spheroidal (more or less) cross section. In 
particular, this suggests that the students within a course 
tend to score all 13 components more or less the same (all 
high, all moderate, or all low), but perceptions from student 
to student differ. 

Turning to the principal components analysis of Co Is 
is seen that 85 percent of the total variability is accounted 
for by the first principal component, and the second accounts 
for only three percent. Thus the data swarm of course means 
may be viewea as essentially one dimensional. Reference to 
its eigenvector reveals no single SOF item or group of SOF 
items is heavily weighted relative to the others and that the 
Signs are again all the same. Thus, this component is 
Similarly shaped along the main -diagonal of 13 space, but 
more extremely elongated. 

Some exploratory work was done on the within class 
variability (S) to see if the "number of quarters completed” 
by students has any effect on the variability represented 
by S. Figure four presents the results with a graph plotting 
within course variance versus time on board. Note the 


tendency for the variability to drop off in later quarters, 


possibly indicating more perfunctory completion of the forms. 
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ee leben CA METHOD 


A. THE ALGORITHM 

The specific algorithm chosen for the cluster analysis 
is the MIKCA (Multivariate Iterative K-MEANS Clustering 
Algorithm) program written by Douglas J. McRae as a part 
of his doctoral dissertation at the University of North 
Carolina, Chapel Hill. 

Reference to the flow chart in figure 5 will aid the 
reader in the following discussion of the algorithm. Inputs 
to the program are the data matrix, an estimate for g (the 
number of clusters), and choice of criterion and distance 
Pumetlons. 

In the first step, preliminary calculations are made, 
such as the variable means and standard deviations, as well 
as the cross products matrix T. The next step forms the 
initial cluster solution. A random cnoice of s observations 
serves as the initial cluster centers. Then each of the 
other observations is assigned to the nearest cluster. 
Euclidean distance is used for this initial phase, and the 
cluster centroids are recomputed after each observation is 
asSigned to a group. The observations are considered in the 
Same order as they were input. After all of them have been 
asSigned to clusters, the criterion value is computed. This 
initial cluster-finding technique is referred to as a one- 


pass K-MEANS procedure. It is performed three times, and 
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= INDIVIDUAL SWITCHES 


PEGURE 5 


MIKCA FLOW CHART 











the solution which yields the best criterion value is chosen 
as the initial cluster solution. 

After the initial solution has been found, the program 
advances to the iterative K-MEANS phase where the observa- 
tions are again considered in the order in which they were 
input to the program. It is this phase where the user's 
choice of distance function is used. The distance from each 
observation to each cluster centroid is again computed, this 
time with the user's distance function, the assignment to the 
closest centroid being made and the centroid updated to 
reflect its new membership. After considering all n obser- 
vations in this manner, the new criterion value is checked 
for possible improvement during the K-~MEANS iteration. As 
long as the criterion value improves, the K-MEANS procedure 
1s repeated; if the criterion fails to improve then the MIKCA 
algorithm goes to the next step, the individual switches 
SeceElon. 

Note the importance of the order of consideration of the 
Observations. The order is important because the cluster 
means are recomputed after each observation is reassigned. 

In the individual switches phase, consideration is given 
to moving each observation to every other cluster, the move 
being made if and only if an improvement in the value of the 
criterion results. An elaborate labelling procedure pro- 
vides a unique order in which to consider each observation. 
This procedure continues until a complete pass through the 


data is made with no changes in cluster membership. 
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The MIKCA algorithm provides the following options for 


@istance and criterion functions. 


CRITERION 
1. Minimum trace W 


2. Minimum det W 


li 
ro) 


3. Maximum largest order of |B-AW| 


4. Maximum sum of roots of |B-AW| = 0 


DISTANCE 
im Euclidean 
2. Weighted Euclidean 


3. Mahalanobis 


Using R.A. Fisher's iris data, McRae tested his algorithm 
and produced extremely good results. USing the det W 
criterion and Mahalanobis distance, MIKCA produced a solu- 
tion identical to the classification given by multiple 
discriminant analysis. This is a notable achievement since 
the cluster procedure, which does not know the true composi- 
tion before the analysis, makes the same final classification 
of observations as does the discriminant procedure, which 
bases its analysis on the group composition information. 

The MIKCA provides as output the value of the criterion 
function, the cluster membership, and the cluster mean vectors. 
Also provided are two matrices, T and W. The program was 


written in FORTRAN IV for the IBM 360 series of computers. 
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B&B. MODIFIED MIKCA 

Initially, the MIKCA program was used with the p-component 
mean responses for each course as the input data matrix. 
Since the number of students utilized in producing these 
means is quite variable, these input vectors are not equally 
well determined and, as has been mentioned earlier, may 
effect the covariance structure between the objects. It 
is desirable to have the option of weighting these course 
means in order to effect a better balance in terms of their 
accuracy and to reduce any consequential distortion in the 
covariances. It is convenient to refer to this modification 
as the "1 man 1 vote" option, and to the original technique 
as the "1 course 1 vote" option. The following algebraic 
definitions will aid in illustrating the weighting effect. 

Recall the breakdown of the master scatter matrix into 


the sum of within and between matrices. 


When the mean scores are computed for each course and used 

aS inputs to MIKCA, then a different dispersion structure 
takes form. The groups are no longer the known courses, 

but are now the object of the problem. The groups are unknown 
clusters of courses (or professors). Let T* denote the 

total scatter contained in the data when each observation 
represents a course mean vector. T* may also be expressed 


as the sum of within and between scatter matrices. 


48 





ne 
g Ss 
T* = ) ) (Xo, 7 XL) (kg KL)! 
s=]l1 k=l 
g nc. 
eee ” ) (Sek ~ 4s ) (X oy 7 eee 
s=l] k=l 
g 
Be = ) ne_(x eee) (  X. ) 


where 


nc. is the number of obServations (courses) in 
the s-th cluster. 





g is the number of clusters. 
Xo, is the k-th observation (course mean vector) 
in the s-th cluster 
Xo, is the mean vector of the s-th cluster 
” *Sk 
(Anne) 
S 
x.. is the grand mean 
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Note that the grand mean mentioned here is not the same as 
the grand mean used in the decomposition of the master 
matrix M. The difference between T and T* is that T is 
weighted by the number of students in each course, ns. . This 
weighting factor was lost when the individual observations 
were viewed as the class mean vectors. A close algebraic 
examination of T will illustrate its weighted property. 


Originally, we had M = S + T where: 


Let x,, become Xx_, (k-th course mean in s-th cluster) and 


sk 


ns ; become NS, (number of students in k-th course of s-th 


cluster). Therefore the same T can be reexpressed as 
g ne. 
ee ee) Se xe.) (Ue x) 
s=l1 k=l 
Letting 
ne 
— ) ns eee ee Rey — XS) 
k=1 
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where 


ne 
= ae 
4 8sx *sk 
x —— (weighted mean vector of s-th 
Ss ne 
cluster) 
» NS 1, 
k=1 
then 
g 
Ww = L We 
s=1 
and 
T = Wt bB (B is obtained by subtraction) 


The understanding of this distinction is important because 
it describes the abbreviated (unweighted) dispersion upon 
which MIKCA bases its cluster solution. 

A number of changes were made to the MIKCA computer code 
to allow for a system of weights, ns. for the course means. 
The modified code extends the capability of MIKCA by making 
this option available. It amounts to using T rather than 
T* as the basic dispersion structure. This seems more natural 


because the matrix T appears in the earlier decomposition. 
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Some of the changes are summarized here: 


Ll. Allow for class size as input. 
ge) Siter the computation of T to allow for the weighting 
mAGtOr . 
3. Alter the computation of cluster centroids to allow 
for weighting. 
4. Alter calculations of the B matrix for the same 
reason. (W is found by subtracting B from T.) 
The computer code for the modified MIKCA is included in 
Appendix F. 

Cluster solutions using both weighted (T) and unweighted 
(T*) dispersion structures were found and compared (see 
table 3 in next section). The comparison indicates some 
differences in cluster solutions, however the importance 


of these differences is left to the reader. 


~ 


fe RESULTS 

Several cluster solutions were formed using the MIKCA 
algorithm. It seemed wise to include the number of students 
in a course as the 14-th variable. The natural logarithm 
of the class size was the transformation applied to this 
Variable. Since class sizes ranged from two to 40, this 
transformation brought the values into a similar range as the 
Other 13 variables and also reduced skewness. For quarter 
773 the mean class size was 12.7 students with a standard 


deviation of 7.9. For the transformed variable these 
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statistics are 2.3 and 0.7. Cluster solutions were found 
with and without inclusion of this 14-th variable. The 
results are shown in table three. 

Another option available to the MIKCA user is the stan- 
dardization of variables prior to entering the clustering 


process. McRae points out how this option becomes very 


useful when the variables are on vastly different scales of 
measurement. Except for the 14-th variable the present 
scales are psychological in nature and seem to be much the 
Same. Some exploratory work was performed with the standar- 
dization option (see table three) but it was not considered 
Significant because of the similarity in the scales of 
measurement. 

Table three shows the comparisons of cluster results 
obtained under various conditions. The comparison coeffi- 
cient provides a measure of agreement between solutions and 
is computed by a method introduced in Chapter VII. Table 
three shows generally higher values for g = 3, indicating 
that there exists robustness of solutions for the smaller 
values of g. 

The results of these cluster solutions may also be seen 
in graphical form by referring to Appendix B. These graphs, 
called profile charts, depict the mean vectors for each of 
the clusters formed by the MIKCA algorithm. The mean vectors 
have been standardized so that one can see the number of 


Standard deviations from the grand mean. These profiles are 
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also helpful in identifying the variables which are signi- 
Fficant in the cluster membership. For example, an important 
variable would be one that produces a break in the pattern. 

In the 13 variable case, the profiles produced results 
Which indicated the lack of clearly dominant variables in 
cluster identification. With introduction of the 14-th 
variable, some very revealing results become immediately 
apparent. While the cluster membership changed little in 
going from 13 to 14 variables, the cluster with the nies 
mean vector became clearly associated with the smallest 
class sizes. Similarly the cluster with the lowest mean 
vector is characterized by a very large class size. fThis 
finding is one of the most Significant results. 

One of the most critical decisions facing the analyst 
1s the number of clusters to form. Some algorithms based 
on the K-MEANS idea allow g to change during the clustering 
process, however the MIKCA method requires g to be input by 
the user and it does not change in the course of the pro- 
gram execution. Typically the investigator does not know 
the number of clusters in the data, and he must make some 
educated guess. As pointed out earlier, it 1s possible for 
Several different, but meaningful, cluster solutions to 
exist in one body of data. 

The method used to determine g was to obtain solutions 
based upon several values for g and then plot the criterion 
values for each of these solutions. An appropriate choice 


for g would be a number beyond which the marginal improvement 





ope, 





of the criterion becomes insignificant. Figures 6 ana i 
are the results of such tests suggesting that six clusters 
represent the major portion of the separating power of the 
algorithm. 

Profile charts of the cluster solutions with g = 6 were 
uninteresting. The middle clusters were all bunched 
together suggesting that clusters were forced on that part 
of that data where perhaps they did not actually exist (l.e., 
Sparse data near the boundaries). Comparison results (table 
3) indicate a much more stable solution when g is reduced 


below six. 
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V. DISCRIMINANT ANALYSIS 


A. THEORY 

As mentioned earlier, discriminant analysis allows an 
analyst to classify new observations based on observations 
which are samples from known groups. Only Fisher's linear 
discriminant function will be used in this study. It also 
provides information about the relative importance of the 
various variables in aSSigning an observation to a group. 
The linear diScriminant function is based on the assumptions 
of multivariate normality and homogeneity of dispersions. 
The ability to identify the dominant variables and the 
dimension reduction offered by the diScriminant space were 
both extremely useful aids for analyzing the SOF data. 

These “more important" variables will be earmarked for 
later use in the construction of Chernoff's FACES. Also 
of interest is the plot of data points in discriminant 
Space. The interaction of the coefficients in the dis- 
criminant functions will be seen as well as the character- 
ization of the dimensions. 

In order to describe our usage of discriminant analysis, 
let us first Suppose there are only two clusters in 13- 
dimensional space. It is deisred to project these two 
clusters orthogonally onto a line so that the variation 
between the two groups is as large as possible relative to 
the variation within the two groups. Finding the direction 


of projection to accomplish this is part of the purpose of 


ay 





discriminant analysis. The solution provides a way of 
discriminating between the two clusters by a suitable linear 
combination of the 13 variables. The same theory is gen- 
eralized to g groups, where Wilks [9] has shown that a 
projection to the smaller of g-l or p dimensions is possible 
without loss of information. Recall the earlier discussion 
that indicated this smaller number as t, the number of non- 
zero eigenvalues of w +B. The eigenvalues are the variances 
in the direction of their associated eee cance One can 
easily determine the proportion of variance attributable 

to each of the dimensions of discriminant space and also 

the SOF items which load most heavily in each dimension. 


One gains insight into the variables from examination of 


the coefficients in the discriminant functions. There is 





one function for each dimension, the standardized coeffi- 


cients of which are used in this analysis. 


Bee RESULTS 

Up to this point, most of the analysis has been performed 
on the 190 courses in quarter 773. A smaller, more manage- 
able data base was needed to continue. Also, it seemed 
wise to prepare to study individual departments. The 
Electrical Engineering Department was chosen for further 
analysis since it is a large department and hence not too 
small for this purpose. Over the four quarter period, there 
were 116 course segments with valid SOF responses. These 116 
courses from the EE department were the data used in the 


discriminant analysis. 
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When dealing with four clusters, the dimensionality of 
the discriminant space is three, and depending on the size 
of the eigenvalues, perhaps fewer dimensions will provide 
sufficient discrimination. Table four gives the results of 
performing a discriminant analysis. Figure eight is a graph 
of the two-dimensional discriminant space (the third 
dimension is neglected). 

The eigenvalues indicate 94 percent of the total variance 
is represented by the first two discriminant functions. 
Figure eight corroborates this fact by depicting easily 
seen separation in two dimensions. Imagine projecting the 
points onto the horizontal axis. Discrimination in the 
first dimension would account for 73.6 percent of the 
variation. Groups one and four would easily be separated, 
but two and three would overlap. 

Examination of the coefficients will enable one to label 
tne dimensions by identifying the dominant characteristics 
which they measure. The first dimension is along the hori- 
zontal axis and is associated with the first discriminant 
function. The magnitude of the coefficients indicates their 
relative impact on the dimension. The signs aid in under- 
Standing which variables reinforce one another (matching 
Signs) and which tend to cancel (opposite signs). In the 
first function of table four, SOF item 12 is the most promin- 
ent. This question (see figure 2) asks the student to score 


Ehe Overall rating of the instructor. It is not surprising 


ou 





RESULTS OF DISCRIMINANT ANALYSIS ON THE 116 COURSES 
IN EE DEPARTMENT OVER A FOUR QUARTER PERIOD 


DISCRIMINANT EIGENVALUE RELATIVE 
FUNCTION PERCENTAGE 
1 5.79 eG 
2 1.64 20.8 
3 0.44 5.6 


: 

STANDARDIZED DISCRIMINANT 
DeNecrLON CORN EICIENTS 

| 


Bencteion 1 Funci len 2 Function. 2 
om 1 0 eal 0223 -0.22 
mile 0.14 S06 OS) 
05 1.46 Oo 
47 -0.36 -0.80 
evil Oie07 -0.82 
03 Gm -0.74 
215 -0.36 Oaei2 
me pak. -0.36 
“30 -0.82 -0.47 
S08 0.08 -0.77 
0 -1.16 ile oe: 
Pay -1.08 130 
738 es O92 
TABLE 4 
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that this question is very important in the discrimination 
process. High marks on items four and five tend to rein- 
force a high mark on question 12. Those questions are: 

(4) Difficult concepts were made understandable. 

(5) I had confidence in the instructor's knowledge 

of the subject. 
Interestingly, a high mark on item nine (instructor made 
the course a worthwhile learning experience) tends to 
diminish the effect of a high score on item 12. The first 
dimension is dominated by question 12 and was labeled the 
"popularity" dimension. 

The second dimension is depicted by the vertical axis 
on the graph in figure eight, and measurements along this 
dimension are controlled by the second discriminant function. 
The separating power in this direction is less than one 
third that of the first. Note however that the vertical 
scale 1s compressed 25 percent more than the horizontal 
scale (1.5 inches vertical = 2.0 inches horizontal). Items 
three and eight has strong positive coefficients whereas 
questions 1l and 12 are pulling heavily in the negative 
direction. However the strength of the information is not 
great, and deeper interpretation hardly seems worth the 
errort. 

Only 5.6 percent of the total variance appears in the 
third function, and it is therefore considered insignificant. 
One might note that item 12 also dominates the third 


dimension. 
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The main purpose here has been to identify variables 
for use in constructing Chernoff FACES. The discriminant 
analysis nas served that purpose well, and it has also 


described the character of the dimensions. 
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Vi. CHERNOFF FACES 


A. BACKGROUND 

Chernoff's FACES was the second cluster method to be 
applied to the SOF data. The method was used with the 
same purpose in mind, and it was hoped that earlier cluster 
solutions could be reproduced by this method. Additionally, 
there was the possibility of gaining new information about 
the structure within the data. Professor Herman Chernoff 
developed this graphical method for representing multivariate 
data. The now familiar data point in p-space is represented 
by a computer-drawn cartoon of a face whose characteristics 
(features) are determined by the position of the point. 
Features such as nose length and mouth curvature correspond 
to components of the data point. In the case of the SOF 
data, each component of the 13-dimensional vector can be 
Made to control one of 20 features, and seven constants can 
be selected for the remaining features. The technique lends 
itself to clustering since the investigator can group 
together those faces which resemble each other. 

Chernoff [10] points out that people spend a great deal 
of their life studying and reacting to faces. The human 
Mind subconsciously acts as a high speed computer sometimes 
detecting barely measurable differences and ignoring unim- 
portant differences, even if they are large. Chernoff 


Claims that unlike a machine, the mind has the capability 
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to disregard non-informative data and search for useful 
information. He states that certain major characteristics 
of the faces are instantly observed and easily remembered; 
finer details and correlations become apparent after studying 
the faces. Clustering by sorting the faces is certainly 
eaSier than staring at a large matrix of data. The method 
has pitfalls and limitations and some of them will be dealt 
With in this thesis. 
After the publication of Chernoff's method [11], quite 
a number of people began experimenting with the technique. 
Lake [12] mentions a few more successful applications of 
Chernoff's method, including: 
1. L.A. Bruckner of Los Alamos Scientific Lab of the 
University of California while studying the performance 
of offshore oil companies. 
2. Johns Hopkins University 
a. Developing methods of psychiatric screening. 
b. Monitoring patients in intenSive care units. 
¢. Monitoring the stock market. 
Sm Dr. David L. Huff of the University of Texas in 
developing urban regional indicators that measure the 
quality of life. 
4. Professor P.C.C. Wang and Gerald Lake at the Naval 
Postgraduate School in analyzing Soviet naval penetra- 
tions into the Indian Ocean and the African littoral; 
and Soviet foreign policy in sub-Saharan African 


states. 
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5. Professor Chernoff in geological and fossile- 

related experiments. 

The field of computer graphics has experienced tremendous 
growth in recent years due mainly to the state of the art 
in computers and computer display equipment (including both 
video and plotting types). The adage that "a picture is 
worth a thousand words" has proven to be quite true. Recent 
developments include on-line programs that perform statis- 
tical analysis with polygon, bar graphs, arrows, and scatter 
diagrams. Three-dimensional data displays have facilitated 
the work of engineers and statisticians alike. 

An interesting application of the FACES program is 
Bruckner's study of offshore drilling by oil companies. 
Figure 9 displays some of his results. Two of the features, 
nose width and eye separation, are controlled by the varia- 
bles "expected years to production" and "number of leases 
won", respectively. Other features are controlled by a 
variety of variables representing the company's financial 
health and growth potential. 

Reference to figure 10 will help describe how the faces 
are constructed. Table 5 gives the range of the variables 
which control the features and distance parameters of the 
face. 

The data are first converted to the X parameters as 
follows. The variable Z is used to control the parameter 


X. which is allowed to range from a, to b,. 
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Figure 9 


Bruckner"s Offshore Hydrocarbon Producers 








Figure 10 


Chernoff Face with Ears 
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FEATURE RANGES AND DESCRIPTION 


This table is taken from reference 10, and the descriptions 


are not complete. 
explanation, 
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For a more detailed, mathematical 
see Appendix C. 


distance from 0 to P 
angle between OP and horizontal 
half-height of face 


eccentricity of upper ellipse 
of face (width/height) 


eccentricity of lower ellipse 
of face (width/height) 


length of nose 

position of center of mouth 
curvature of mouth (radius = h/X 9) 
length of mouth 

height of centers of eyes 


separation of centers of eyes 


Slant of eyes 


eccentricity of eyes (height/width) 
half-length of eye (L, also 
depends in part on x and X44) 


10 
position of pupils 


height of eyebrow center relative 
to eye 


angle of brow relative to eye 
length of brow 
diameter 


Gar 


nose width 


TABLE 5 
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where m and M are the observed minimum and maximum of Z. 
Chernoff's technical report [10] presents a very detailed 
description of the geometric relationship of the features 
in the face construction. A few general remarks concerning 
the geometric attributes are included here. The boundary of 
the face is formed by joining portions of two ellipses, an 
upper and a lower. The angle theta (8) determines where 
the ellipses meet and consequently, the height of the ears. 
The nose is a triangle centered at the origin. Both its 
height and width are variable. The curvature of the mouth 
1S a portion of a circle, the radius and center of which are 
also variable. The eyes are formed by ellipses whose angle, 


half-length, and eccentricity are all controlled by variables. 


B. FEATURE-VARIABLE RELATIONSHIP 

A frequent question is whether some features are more 
informative than others. Some observers feel that the eyes 
convey the most information; others regard the mouth or the 
Shape of the face as the most relevant feature. The results 
of the discriminant analysis identified the most dominant 
variables in the discriminant space. Now these variables 
must be assigned to facial features. 

Chernoff [13] himself conducted an experiment to evaluate 
the effect on classification error of random permutations 


in the assignment of variables to features. He found that 
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random permutations would change the faces so that a 
classifier might increase or decrease his number of errors 
by a factor of about 25 percent. Unfortunately, his experi- 
ment did not evaluate the efficiency of specific features. 
His studies also make no effort to determine whether ability 
to discriminate depends on the dimensionality of the data. 

Considering Chernoff's findings, it would seem that 
the assignemnt of variables to features is of minor 
importance. The use of discriminant analysis provides a 
way of detecting which variables are important, and it 
seems appropriate to take advantage of this valuable infor- 
mMation when constructing the faces. Moreover, there is 
choice in the features that are selected for use. The 
author's choice of the six best features are starred in 
table 6. The table gives the complete list of feature- 
variable combinations. The results of the discriminant 
analysis were relied upon heavily in forming the variable 
assignments. 

Reference to figure eight (discriminant space) and 
table 6 will aid in the following discussion. In the first 
dimension the important SOF items are 12 and 4 which con- 
trol the mouth curvature and ear height, respectively. 

High scores on these two items separate the observation 
well to the negative end of the scale and cause the face 

to have a big smile and high ears. Items 12 and 4 have the 
Same Sign (negative) but item 9 is associated with a large 


positive coefficient and controls the lower eccentricity of 
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FEATURE-VARIABLE COMBINATIONS 


SOREE DIPRERENT “RIALS 


FEATURE CONTROLLED BY 
13 VAR 6 VAR 8 VAR 
i. FACE WIDTH 5 Gee OES) 
2 ANGLE 90 4 Bia. 4 
3 FACE HEIGHT eas, Ce Ora, 
3 BeerER ECCENTRICITY 8 Oo) Ois.'5 
Bis BewER ECCENTRICITY 9 4 Oise 
6 NOSE LENGTH 10 0.45 9 
7 MOUTH CENTER Cres) O38 O25 
aes MOUTH CURVATURE 12 2 19 
9 MOUTH LENGTH 13 Ore: Ours 
1a: EyrE HEIGHT ORS 28: Die? Ou2 3 
unl EYE SEPARATION 1 0.5 025 
ei2 EYE SLANT teh 8) S 
i mae ECCENTRICITY 3 0.6 Oo 
14 EYE HALF LENGTH 6 5 5 
B15 mee ii POSITION 2 g Ite 
16 EYEBROW HEIGHT ee Os O43 
el7 EYEBROW ANGLE S 8 8 
18 EYEBROW LENGTH 0.4 0.4 0.4 
a9 EAR DIAMETER Oe 8 05.3 as: 
-20 NOSE WIDTH 7 ae jbae 


Integer numbers are the SOF item #. 
Decimal values are the fixed features 


TABLE 6 
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the face. A low mark on this item would complement high 
marks on items 12 and 4 and would be reflected in the lower 
face naving small eccentricity (more narrow). 

Turning to the vertical axis (second dimension) of 
figure eight which has 21 percent of the total variance, 
tne dominant variables are 3, 8, and 11, where 11 is nega- 
tive; 3 and 8 are positive. High scores on items 3 and 8 
separate the observation upward on the vertical axis and 
are reflected as highly eccentric eyes and upper ellipse. 

Droopy eyes, reflecting a small value for SOF item 1l, 
tend to complement and reinforce the higher values for 
items three and eight. It seems like a good idea to use 
the results of the discriminant analysis in this way, but 
it 1S impossible for the viewer to know which variables act 
together and which interfere unless he is told beforehand. 

A good deal of exploratory work was carried out to 
determine useful ranges for the features. The more the 
features are allowed to vary, the wider the variety of 
faces produced. With large ranges, however, faces formed 
from extreme data can become very distorted. On the other 
extreme, too little variability in the ranges suppresses 
valuable information and hinders the clustering process. 

It appears that the best ranges depend on the structure 
and amount of variability in the data. Every data set has 
lts own characteristics, and it is best to tailor each to 
its own best set of ranges. A great portion of the SOF 


data is found "close" to the grand mean, but with a few 
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Significant outliners. In order to provide discriminating 
ability among the largest mass of the data, the appearance 
of the outliers was further accentuated. The ranges were 
set at values which would allow close-in discrimination, 
but simultaneously attempted to minimize the departure of 


the outliers. 


SeeecuUolERING THE FACES 

The next step in this research is to cluster the faces. 
This task was performed by six students in the Operations 
Research curriculum. The faces are shown in figure 1l; the 
33 course segments from the Electrical Engineering depart- 
ment in quarter 781. The judges (students performing the 
clustering) were given no information concerning the feature- 
variable combination. They were simply instructed to group 
the faces in the manner which best suited them. Fifteen 
Minutes were allowed for the task. The purpose was to 
quickly, but carefully, cluster the faces. The judges 
were reminded that each face is different and to search for 
the most natural looking clusters. It was felt that too 
much time spent on this task could defeat the purpose of the 
faces as a first pass look at the data. In every case, the 
judges acted independently of one another. No clues were 
provided which might have indicated which features were 
more important. 

Figure 11 shows the faces in the clusters which were 


formed by the MIKCA algorithm. This cluster structure was 
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Figure ll 
CLUSTERS DETERMINED BY MIKCA 








used aS a Standard against which the judges' results were 
compared. Table 7 shows the results of this experiment. 
There was considerable agreement among the judges, as 
indicated by the comparison coefficients. There was also 
a good deal of similarity between the clusters formed by 
the judges and those formed by the MIKCA algorithm. 

Several comments by the judges indicated the difficul- 
ties they encountered. The most prevalent comment was the 
difficulty in deciding which feature to consider the most 
important. One judge considered the mouth first in every 
case wnile another judge used the slant of the eyes as 
a more important variable. The judges also indicated that 
trying to evaluate simultaneously differences in many 
features was quite difficult. It is interesting to note 
that the judges' results were quite similar despite the 
fact that different criteria were employed as they formed 
the clusters. 

The SOF identification numbers have been altered for 
this repert. There were two course segments which erron- 
eously reported the same SOF number (See face 150). As one 
looks at the faces with the discriminant space in mind, it 
is much easier to form a clustering which is similar to the 
MIKCA solution. One would be aware, for example, that the 
Position Of the pupils is critical in that it can diminish 
the effect of the smile and impact heavily on the horizontal 
dimension. This effect can be seen by referring to faces 


139 and 140; they are included in a group whose smiles are 
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not as large as their own because the position of their 
pupils (positive to the right) has diminished the impact 
of the curvature of the mouth. 

Another example of the interaction of variables is 
seen by referring to face 152. Most judges would quickly 
include this face with the group of four at the top of 
figure ll. A subtle difference, however, is the ear 
position. Reference to the discriminant function coeffi- 
cients will indicate a negative which has offset the slant 
of the eyes in the second dimension. 

Knowledge of the discriminant functions helps alleviate 
the confusion which sets in when attempting to cluster. It 
is especially true in this case where so little difference 
exists between the majority of the faces in the middle groups. 

Die iculty in evaluating all 13 features simultaneously 
was a problem. As an alternative to this set of faces, two 
other sets were produced, one with only six variables fea- 
tures and the other with eight. Figure 12 contains samples 
from these sets, 12a the six variable set and 12b the eight 
variable. Of course, not all of the data is represented in 
this manner. Only those variables which loaded heavily in 
the discriminant analysis were used, and the features con- 
trolled by those variables are the ones considered to convey 
the most information. Table 6 gives the complete feature- 
variable combinations used in the construction of all sets 
of faces. The data used in constructing the set of 33 faces 


is found in Appendix B. 
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D. PROBLEMS ENCOUNTERED 

The last section addressed difficulties faced by the 
judges because it was impossible for them to be aware of 
the information contained in the discriminant analysis. 
Of course, it would be pointed out that there is little 
reason to use this particular MIKCA solution as the stan- 
dard, but it does serve as an objective standard, as it 
was desired to compare the machine results with the human 
results. This section addresses problems of a more 
mechanical nature. 

Exploratory work with the faces uncovered quite a 
number of relationships between the features. The exis- 
tance of geometric dependencies (not discriminant-type 
effects) between features caused difficulties in clearly 
displaying the variables. Two notable examples are 
mentioned here. 

The length of the mouth is guite dependent upon its 
curvature. The projection on the horizontal axis (no 


relation to discriminant axis) has half-length 


aegis X9 (h/ | X8 | ) 
where X8 is the mouth curvature. The variables which con- 
trol these features are thus automatically forced into this 
dependent status regardless of their true relationship. 
The other example concerns the ellipses forming the 


facial boundary and the angle theta. The upper ellipse is 
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drawn through the points P', U, and P; the lower through 
P', L, and P (see figure 10). Two faces with identical 
values for the ellipses might have quite different appearing 
facial boundaries due to the dependence on theta for the 
points P' and P. This is another example of forced dependence. 

In order for the width and height of the face to meet a 
specified constant, the program "normalizes" both horizontal 
and vertical axes. This normalization eliminates the 
effects of Xl and X3, and it adjusts all of the features 
during the process. It is believed to be this normalization 
process which causes faces which are growing wider and wider 
to suddenly revert to one-half the widest width when the 
width exceeds a threshold value. A similar phenomenon is 
experienced in the height variable. This half-size adjust- 
ment may be seen in figure 12b. Face 132 has been changed 
by a disproportionate amount due to the normalization pro- 
cess. It is of interest to point out that the face-width 
feature was being held constant during the construction of 
that set of faces. 

Yet another hidden dependency is that of the nose length 


on the eye height. The eyes are located at height 
eae a Dios ele EO) XG | 
where X6 controls the length of the nose. 


These and other subtle dependencies mislead the inves- 


tigator if he is not aware of their existence. These problems 
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reduce the ability of the faces to effectively display the 


full 20 dimensions. Unfortunately, these points 
explicit in the original document [11] and their 
was an ll-th hour surprise. It was not possible 
for them or to uncover all such relationships at 
writing. Appendix C gives a complete listing of 


used in the construction of the faces. 
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A. BACKGROUND AND ALTERNATIVES 

Repeated use of the comparison coefficient has been 
made in this study. The present chapter is devoted to an 
explanation of this measure of association. The method 
should be flexible enough to handle multiple comparisons 
Simultaneously, thus enabling one to measure the overall 
agreement of several judges. 

It was decided the best way to display the agreement 
of two judges was through the use of a contingency table. 


Table 8 1S an example of one to be used in the discussion. 


Judge X 





Judge Y 
wo 


The contingency table indicates the agreement of the two 
judges. The purpose of this chapter is to find a measure 
which evaluates how close this agreement is. Note that 
judge X categorized the observations into three clusters 
with 6, 3, and 4 elements, respectively. Of the 13 observa- 


tions, judge Y placed six in one group and seven in another. 
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The labeling of the clusters is arbitrary. The upper left 
entry in the table indicates that five of the objects in 

judge X's category A matched with five of judge Y's category 
A. The entire table 1s interpreted in this manner. Notice 
that 1£ one chooses to call this entry of five as representing 
agreement, then the entry of 1 below it and the 1 in the 

top right corner must represent some of the observations 

On which the judges disagreed. 

The contingency table idea is easily generalized to 
higher dimensions (more than two judges). In three dimen- 
sions, a box (or cube) would represent the table, with 
elements internal to the box measuring agreement between 
three judges. 

One method for measuring the degree of agreement is to 
find the largest combination of entires such that only one 
per row and one per column are chosen. This task becomes 
very difficult as the number of clusters increases, but it 
can be solved through the use of linear programming tech- 
niques. It is a constrained optimization with a linear 
objective function and is an application of the "assignment 
problem." Unfortunately, when generalizing to higher dimen- 
Slons, the L.P. loses its unimodularity attribute and the 
number of constraints and variables in the problem becomes 
prohibitively large. 

The Chi-square contingency statistic was considered 
inappropriate because, when using the smaller sample sizes, 


more than 20 percent of the cells have expected frequencies 
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of less than five (see ref. 14). Even when using the 
190 element sample, there were frequent occasions when this 
same difficulty persisted. The Chi-square statistic was 
not used since it could not have been applied consistently 
throughout the analysis. 

Professor James Hartman provided an idea that led to 


the method finally put into uSe. 


Eyeeeche, TECHNIQUE 

The idea was to sum the squares of the entries in the 
contingency table and then "normalize" this quantity. 
Summing squares offers an excellent method for measuring 
the degree of association, however the following example 


1llustrates the need for some sort of adjustment factor. 


TABLE 9 


Both tables represent perfect agreement on 20 observations, 
however table 9a has a sum of squares equal to 200 and 9b 
has a value of 362. It is desired to indicate both of 
these examples as perfect agreement with one being no 
better than the other. Hence, 1t became necessary to 


determine the "best possible" sum of Squares in every given 
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Situation. A computer program was written for this purpose 
and is included in Appendix F. The statistic which is 

used aS a comparison coefficient is a number between zero 
and one, formed as the ratio of the actual sum of squares 
to the "best possible" sum of squares. The best possible 
Sum of squares 1S a computed sum uSing a minimax approach and 
1s based on the number of judges, number of clusters by 
eath judge, and the number of observations within each 
cluster. The minimax procedure does not need to know which 
observations make up a cluster, only how many observations. 
An example showing the computation of the comparison 
coefficient 1S given in Appendix E. 

This method for measuring the degree of agreement 
provides the analyst a standard scale upon which to compare 
coefficients based on solutions involving varying numbers 
of clusters and cluster memberships, as well as varying 
numbers of judges. 

In order to provide some sensitivity for the signifi- 
cance of this measure, several cluster solutions were 
formed wholly at random and compared to results produced 
by MIKCA and the judges. In every case, the values of the 


comparison coefficients were less than 0.1. 
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VIII. SUMMARY AND CONCLUSIONS 


This research has been largely exploratory. A path 
has been paved for others to follow in examining the SOF 
data. The theory of cluster analysis and its relationship 
to discriminant theory have been carefully examined with 
emphasis on two widely divergent: techniques. In the analysis 
of the data, attempts have been made to identify the under- 
lying structure of which the clusters are a consequence. 
This chapter is devoted to separate discussions of the 
cluster methodology explored and the interpretation of 
the SOF data. 

Although the development of methodology phase of the 
research was carried to completion in a general sense, a 
number of problems were encountered along the way. Many 
of these problems are deserving of deeper treatment and 
are discussed below. 

The data transformation was the best of the three con- 
Sidered. It produced the smallest test statistic for homo- 
geneity of covariances, but the value itself was not in 
the acceptable range, based on normal theory. It should 
be possible to improve the choice. 

The modifications of MIKCA to allow for weighting of 
the input vectors has been effected and well tested. It 
1S an important added capability for this program. 

The use of discriminant analysis to discover the 


important variables affecting the clustering is, no doubt, 
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not new. It needs some refinement, however, because it is 
not clear how one should rate the importance of variables 
Supporting the first dimension to those supporting the 
second (or any other dimension). Such a set of priorities 
could be most useful. 

The idea of using the important variables (and their 
Signs) of the discriminant functions in the problem of 
asSigning sets of variables to sets of features is believed 
to be new. It may have great potential in providing a way 
for the Chernoff face technique to replace the more expensive 
technique based on computer iteration. 

The present attempt to work with the faces was disappointing. 
This is due largely to the face that certain restrictions, 
truncations, and discontinuities in the movement of the 
features were not well documented in our sources. Their 
discovery came aS a Surprise late in the research and it 
was not feasible to go back and readjust. Such readjustment 
1s clearly called for and would require a substantial effort 
in the future studies. 

The coefficient of comparison was a new idea and there 
was insufficient time to explore its properties. What is 
needed is more investigation in order to interpret its 
various values (or another measure whose values are inter- 
pretable). The comparison measure is also useful in the 
problem of assigning variables to features when working with 


faces. The goal is to choose assignments having the property 
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that the judges are in good agreement when forming the 
clusters. 

The study of the SOF data which was made while developing 
the cluster methods produced results about student evalua- 
tions of courses (and instructors). The results are 
discussed below. 

A principal components analysis of the data swarm of 
mean vectors showed it to be essentially one dimensional and 
having the direction of the main diagonal of 13-space. The 
interpretation os this is that all 13 items are equally 
important in the students' perception of rating the course 
and its instructor. On the other hand, this same effect 
would be produced by careless, perfunctory completion of 
the forms by many students. 

The partitioning of the data into three or four clusters 
by MIKCA is more or less successful. The clusters are not 
sharply separated (there are no great voids between them). 
Study should be made to see how much the density of the 
data diminishes near the boundaries of the partitions. 

Although the main data swarm 1s essentially one dimen- 
Sional, it appears useful to use two dimensions to describe 
the individual partitions after clustering. In doing this, 
varlable 12 (overall rating of the instructor) emerged as 
most important in the first dimension and variables 3, 8, 
and 11 giving support in the second. Only one discriminant 


study is reported here, although several were performed. 


ct 





Variable 12 appears to have permanence while the other 
variables often shift in importance. 

The cluster profiles which track the cluster centroids 
over the 13 variables provide a set of (almost) horizontal 
lines. This supports further the one dimensional interpre- 
tation of the data swarm. The result is not sensitive to 
whether or not the data are standardized. 

The results of applying the modified MIKCA did not vary 
greatly from the results of applying the original MIKCA. 
Hence the number and composition of the clusters is not dis- 
turbed much by the variability in class size. 

The relative position of the clusters is strongly and 
inversely related to class size. The courses that receive 
uniformly high ratings are associated with the small class 
sizes and the courses receiving uniformly poor ratings are 
associated with the large class sizes. 

All judges reported use of a hierarchical approach to 
separate the faces into clusters. Most judges first separated 
the faces into two groups according to the curvature of the 
mouth (smile or frown). There was little agreement about 
which features were important in further subdividing the 
two main groups, hence some disparity resulted in their 
final cluster solutions. 

The MIKCA procedure 1S a sophisticated approach to cluster 
analysis; its results are based on sound statistical theory. 


The modified version of that computer program is considered 
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particularly well suited to the SOF data or any other data 
set possessing the same predetermined class structure. The 
impact of class size on cluster membership has been empha- 
Sized. This important issue may indicate the smaller classes 
receive artificially inflated SOF scores. Consideration to 
this fact surely must be given by those who use these scores 


as a means for evaluating teacher performance. 
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APPENDIX A 
Test for the Equality of Dispersion Matrices of k Groups 


Given a sample of k groups and m variables with group 
Beeeersion matrices, Si, (1 = l, ..., kK) pooled within- 
groups dispersion matrix So! and total sample observations 


k 
N= )} 
gG= 


ae Box shows that the hypothesis 
Ih 


May be tested by an F statistic developed from 


k 
A = i St) 1+ (N= KI - } ONg = Is In(|S, |) 
i=1 
k 
1 il 2 
eee yom (em + am 1) 
= ee! 
Sage — eine 1) 
k 
[ } ——_+—, = +] eet al) 9 Cte 2) 
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imewee) > CC, the test statistic is 


(E) AUG -B+2/E)_,  . _D 

D' "E-A(1L-B+t 2/E) E 
im C > B-, the test statistic is 

A a) 

(5) (l - B - D/E) Fo 
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APPENDIX C 
The following information is taken directly 
from Chernof£f's technical report [9] 

Construction of Faces 

Given 18 numbers (X)pXor-0- 7X1 Q) in appropriate ranges 
(which will usually be 0 to 1), we define a face (see 
Fig. 3) as follows. Let H be a nominal distance and let 
h* = (1+x,)# be the distance from the origin to a "corner" 
point P. As xy varies from 0 to 1, h* varies from H/2 to 
H. Let 90* = (2x,-1) 7/4 be the angle of OP with the hori- 
zontal. Let P' be a point symmetric to P about the verti- 
cal axis through 0. Let h = 5 (1+x5) H represent the distance 
from 0 to U the top of the head and L the bottom of the 
head, both on the vertical line through O. The upper part 
of the head is an ellipse which is determined by P', U, 
and P and an eccentricity Xq- Let Xy represent the ratio 
of the width to height of the upper ellipse. Similarly, 
Xe 1s the same ratio for the ellipse through P', L, and P. 
The nose is a vertical line of length 2hx, with 0 as center. 
The mouth intersects the vertical line extended through the 
nose at a point P whose distance below 0 is h[x.+(1-x,)x¢]- 
This represents a point Xo part of the way from the bottom 
of the nose to U. The mouth is part of a circle whose center 
is h/X9 above Pa Thus a positive value of Xo yields a 


smile. The mouth is symmetric about the vertical axis 
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through 0. Its projection on the horizontal axis has 


the half-length a_ = Xq (h/| xg |) unless (h/|x.|) exeeds the 


m 


half-width ve of the face at the height of eee that 


case XW, is used. The eyes are located at a height 


vg 


A h[xX, gt (1-x, 9) x,_] above 0 and at centers which are 


“oi w, (1+2x,,)/4 from the vertical axis where We is the 


half-width of the face at the height Yo° They are symmetrically 
slanted at an angle 6 = (2x, 5-1) 1/5 with the horizontal. 


The eyes are ellipses with eccentricity x (height/length 


13 


before slanting) and half-length L, = X, 4Min (xX, ,W-X,) - 
The only asymmetry appears in the location of the pupils 
which move together an amount r (2x, 5-1) from the center 
of the eye where r = (cos*@ + sin*e/x7,) 1/71, is the 
horizontal half-length of the slanted eye at height py 
Finally the eyebrows are symmetrically located with 
re) LX 
e 


centers at a height Mi Zs above the eye centers 


16 es 
and slant 2(x, 5-1) 1/5 with respect to the eye, 1l.e., 
Q** = 9+ (2x, 5-1) 1/5 with respect to the horizontal and 


half-length L, = ri(2x%,,+1)/2. 


Ih 
One final step taken by the programmer and which has 
been left intact, 1s to normalize both horizontal and 
vertical axes, each by a multiplicative factor, so that 
the width of the head at its widest part and its height 
are both equal to a specified constant. This step, which 


essentially removes two degrees of freedom, was left 


unaltered for intuitive and aesthetic reasons that are 
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somewhat vague and may require reconsideration when dealing 
with 18-dimensional data. In the meantime, the effects 

of xy and X. are almost but not completely eliminated 
because of the secondary effects of the normalization, 
which will adjust all of the other features at the same 
time as the width and height are normalized. 

Most of the parameters x; are adjusted to range within 
a subinterval of (0,1). The exceptions are two of the 
eccentricities, xy and Xey and the parameter controlling 
curvature of the mouth, Xg- Ordinarily, xy and xX, are kept 
Meenim 1/2 to 2, and Xo 1s kept within (-5,5). The eccen- 
tricity of the eye X13 has usually been kept within (.4,.8). 
Some of the ranges must be controlled carefully. We do not 
want negative length eyes. Others need not be so carefully 
controlled. It 1S no calamity to have eyes extend beyond 
the face. 

When the two ellipses of the head meet smoothly, the 
corner point P is lost, and the variable Xo loses effect. 
mestricting xy and Xe to widely separated ranges seems to 
avoid this problem. 

Data are converted to the x parameters as follows. If 


the variable Z is used to control the parameter Xs, which 


1s to be allowed to range from as eS. b., we let 








where m and M are the observed minimum and maximum of Z. 
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re 


Formulae Used on the Construction 


We describe a few of the less trivial formulae used 


in the construction of the faces. 
The point P has coordinates x, = h* #ees_6* and a gyi tepsbieh, iol at 


The ellipse through PUP' has equation 


y 
2 (y - C.,) 


x 
—— + =) 
Ee oe 
u u 
where 
b =h-c_#s, 
u el 
Fu 7 XP. 
and 
2 
ili *5 
C.F sf (hty,) = a , 
4 Yo 


The ellipse through PLP' has equation 


2 i = c.)* 


where 


oO 
it 
a 
a 
Q 


and 
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eviews ) - ——° _ 

ss 2 Yo 2 (-h- ; 
“5 Yo 


The head is then described by (tx(y),y) where 


Z as by way 
| Sag OY Xeree ] 


x(y) 


ee 2 1 

x, [by (y-c)) 0] 
The mouth is a circular arc with curvature |Xe/h | 

through (O,y_) where le = “h(x. + (1-x)x,). Leas 


described by 


h lin 2 Z 
y = 4 tuilsagn os..)[ i —) pa ls 
m 8 [Xe | aXe) 
O05 xX 5 a, 
where 
a, = Xq min[x(y,,) -h/|X./] 


The eyes are nominally centered at (X.7Y,) where 


Ke 
| 


h[xX + (1-x, 9) x] 


rs 
MI 


x(y,) [1+2x,,]/4 


OF 





and have half-length 


Le me mintx,,x(y_)-x,] 


Let (u,v) be the coordinates of an ellipse with center at 


the origin, half-length L. and eceentricity x Then 


u*) 172 


Ss 
v= aap (L’ - describes part of the ellipse. A 
Similar part of the slanted eye can be described for 
te lL by 


x = Xe + u cos @ - v sin 86 


Va) ae Sin 8 + v cos @ 


Ke 
ll 


and symmetry is used to complete both eyes. 

To place the pupils within the eyes, both are moved 
a distance r(2x,,5- 1) from the center of the eye, where 
PeeiennOrizontal half-length of the slanted eye at 


2 2) 1/2 


height Yes is (u' +v when v/u = tan 9. This ylelds 


Z i en - 71/2 


The program then normalizes all heights and widths by 
multiplicative factor k/h and k/max x(y) respectively. 


Currently k is set at 2 inches. 
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APPENDIX D 


These are the 33 observations (transformed data) 


from the Electrical Engineering Department. 
There are two rows of data per SOF number. 


The first 13 entries are 


the 14-th is class size. 


SOF item scores, 
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An Example of the Comparison Coefficient 


Given two judges who cluster 20 obServations (numbered 


1 through 20) into groups as shown below: 


Judge X Judge Y 
@iuster lL AES ee ae | cluster 1 SO 7 
Cluster 2 bo, tl 2,13 cluster 2 ieee a 9S 
Cluster 3 ae Ol d.5, cluster 3 oie | am 2) 
iG, 17,18 
Cluster 4 1L9A20 cluster 4 LOwlk sie | Agee.) 7 » 
VS 0 


The contingency table appears below with marginal (row) 





® totals. 
: 
Judge X 
1 2 3 4 
i 3 
Judge 2 : 
y 3 5) 
4 8 
5. 5 8 yi 20 


Step 1: Find the sum of squares of the table entries. 


iemoem ho + 4+4+i+i+25+4 = 60 


levee 








Step 2: Find best possible sum of squares. 
(Read in two columns) 


Judge X Judge Y 
# Obs in clusters # obs in clusters Suberaceing 
8 5 0 0 
6 5 1 0 
3 8 iL 0 
_8 _2 _o 2 
Max 8 8 Max I 2 
Min of Max's = 8 Minimax = l1 
Subtract Minimax from 
the max element and repeat Subtracting 
c} 5 0 0 
6 5 0 0 
3 0 al 0 
_o 2 oO. i 
Max 6 5 Max 1 ui 
Minimax = 5 Minimax = 1 
Subtracting Subtracting 
3 0 0 0 
i S 0 0 
3 0 0 0 
oO 2 oO au 
Max 3 5 Finished Step 2 
Minimax = 3 
Smotracting 
0 0 
al 2 
3 0 
9 re 
Max 3 Z 
Minimax = 2 


a2 





Step 3: Sum the squares of Minimax's 


Soot 25 >t 9 +t 4+i1¢iit 


104 


104 


Best possible sum of squares 


Actual sum of squares 
Best Possible Sum of Squares 


Comparison coefficient 


=SG0m) _ 
= ty 0.58 
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